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Network motifs are small building blocks of complex networks, such as gene regulatory networks. 
The frequent appearance of a motif may be an indication of some network-specific utility for that 
motif, such as speeding up the response times of gene circuits. However, the precise nature of 
the connection between motifs and the global structure and function of networks remains unclear. 
Here wo show that the global structure of some real networks is statistically determined by the 
distributions of local motifs of size at most 3, once wo augment motifs to include node degree 
information. That is, remarkably, the global properties of these networks arc fixed by the probability 
of the presence of links between node triples, once this probability accounts for the degree of the 
individual nodes. We consider a social web of trust, protein interactions, scientific collaborations, 
air transportation, the Internet, and a power grid. In all cases except the power grid, random 
networks that maintain the degree-enriched connectivity profiles for node triples in the original 
network reproduce all its local and global properties. This finding provides au alternative statistical 
explanation for motif significance. It also impacts research on network topology modeling and 
generation. Such models and generators are guaranteed to reproduce essential local and global 
network properties as soon as they reproduce their 3-node connectivity statistics. 



I. INTRODUCTION 

A promising direction in the studies of the structure 
and function of complex networks is to identify their 
building blocks, or motifs [1-3], which are small sub- 
graphs in a real network. A great deal of research — in 
particular, research on gene regulatory networks — shows 
that specific motifs perform specific functions, such as 
speeding up response times of regulatory networks [4, 5]. 
However, motifs have also raised many questions [6-13], 
including continuing debates on whether and how mo- 
tif statistical profiles arc related to the global structure, 
function, and evolution of certain networks. 

Our recent work [14] introduces dK-sevies, see Sec- 
tion II. The dif -series, with analogy to the Taylor or 
Fourier series, is the first systematic and complete basis 
for characterizing network structure. The rfii'-series is a 
generalization of known degree-based statistical charac- 
teristics of complex networks. The zero-th element of the 
dK-sexies, the Oii'- "distribution," is the average degree in 
a given network. The first element, the IfC-distribution, 
is the network's degree distribution, or the number of 
nodes — subgraphs of size 1 — of degree k. The second el- 
ement, the 2if -distribution, is the joint degree distribu- 
tion, the number of subgraphs of size 2 — links — involving 
nodes of degrees ki and k2- For d = 3, the subgraphs 
are triangles and wedges, composed of nodes of degrees 
ki, k2, and ^3. Generalizing, the rfif -distribution is the 
numbers of different subgraphs of size d involving nodes 
of degrees ki, k2, ■ ■ ■ , kd- 



The (i/f -scries is systematic and complete because it 
is inclusive and converging. Inclusiveness results from 
the fact that the {d-\- l)iir-distribution contains the same 
information about the network as the rfii'-distribution, 
plus some additional information. That is, by increasing 
d, we provide increasingly more detail about the net- 
work structure. As d increases toward the network size, 
we fully specify the entire network structure, which ex- 
plains the second convergence property of dK-seiies — it 
converges to the given network in the limit of large d. 

Does this convergence happen only at d equal to the 
network size, or much sooner, at smaller dl In other 
words, how much local information, i.e., information 
about concentrations of degree-labeled subgraphs of what 
size, is needed to fully capture global network structure? 

To answer these questions, we must compare a real 
network with typical random networks defined by its dK- 
distribution. If there is no difference between such dK- 
random networks and the real network, then the latter 
is fixed by its di^T-distribution. To obtain a dK-random 
version of the real network, we d/T-randomize it as illus- 
trated in Fig. 1(a) — we randomly rewire (pairs of) links 
preserving the di^T-distribution in the network, general- 
izing known network randomization techniques [17, 18] 
used to compute motif statistical significance. The re- 
sult of this dJt'-randomization procedure are random net- 
works that have the same dif-distribution as the original 
real network, but that are maximally random in all other 
respects. 

Our question thus becomes what is the minimum value 
of d such that there is no difference between a real net- 
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FIG. 1: The d/('-randomization null models for d = 0, 1, 2, 3. a) Illustration of dif-randomizing rewiring. All nodes 
are labeled by their degrees, and a dJC'-rewiring preserves the graph's dJf-distribution, and consequently its d'^RT-distributions 
for all d! < d, but randomizes the d"iS'-distributions for d" > d. The OA'-randomization involves rewiring of a link to any pair 
of disconnected nodes, which preserves the average degree only. The IJC'-randomization preserves the degree distribution, too, 
by rewiring a pair of links as shown. The 2A'-distribution preserves the joint degree distribution as well, because at least two 
nodes adjacent to the rewired links are of the same degree. The STf-randomization preserves the number of degree-labeled 
wedges and triangles. As d increases, the rewiring becomes increasingly more constrained since fewer links can be rewired 
without altering the d/C-distribution. To d/iT-randomize a network, we randomly select a pair of links, and rewire them if they 
can be dA'-rewired, or, if they cannot be rewired, select another random pair. This process is repeated for a sufficient number 
of successful rewirings, i.e., until all network properties stop changing, at which point we say that the graph has converged to 
its dif-randomization. b) Visualization of the social web of trust (PGP network [15]) and its dA'-randomizations. 
We use the LaNet-vi tool [16] for visualization, which encodes the node coreness in color, see the right legends. The coreness is 
a measure of node centrality, i.e., how deeply in the network core the node lies [16]. Nodes with larger coreness are also placed 
closer to the circle centers. The quick convergence of the d/('-randomizations to the original PGP network, and the similarity 
between it and its SivT-randomization are remarkable. 



work and its dif-randomizations? It seems at first that 
the answer to this question should strongly depend on 
the specific networks we consider. 

We consider a variety of social, biological, transporta- 
tion, communication, and technological networks, see 
Section III. Although the dK-seiies applies to directed 
and even annotated networks [19], here we report results 
for undirected networks. The dii'-distributions for di- 
rected or annotated networks contain more information 
than for undirected networks. Therefore, dK-series con- 
verges faster in the former case [19]. Below we show 
the results for the well-studied social web of trust re- 
lationships extracted from Pretty Good Privacy (PGP) 
data [15]. The results for all other networks, except the 
power grid, are similar, cf. Section IV, where we also dis- 
cuss possible reasons for why the power grid appears as 
an exception. 



Fig. 1(b) visualizes the PGP network and its dK- 
randomizations. We observe that the dK-seiies converges 
at d = 3. While the Oi^-random network has little in 
common with the real network, the liiT-random one is 
somewhat more similar, even more so for 2K, and there 
is very little difference between the real PGP network 
and its S-ftT-random counterpart. 

To provide a more detailed and insightful comparison 
between the real network and its dX-randomizations, we 
compute a variety of metrics for each. Some popular met- 
rics, such as degree distribution, average nearest neighbor 
connectivity, clustering, etc., are functions, sometimes 
peculiar, of dJC-distributions, and therefore it is not sur- 
prising that they are properly captured by dK -series, as 
confirmed in Section IV A. We classify metrics that do 
not explicitly depend on d_fC-distributions as microscopic, 
mesoscopic, and macroscopic. We choose them to probe 



3 



PGP Web of Trust 
3K - randomization 
2K - randomization 
1 K - randomization 
OK - randomization 




K n K u 



^ 3K - randomization 
2K - randomization 
1K - randomization 
OK - randomization 



J 



n J 




FIG. 2: Microscopic scale: motifs. There are six different 
graphs of size 4 shown on the x-axes. The top plot shows the 
distribution of the numbers of these subgraphs in the PGP 
network and its disT-randomizations, d — 0,1, 2, 3. Each blue 
bar, for example, is the number of the corresponding subgraph 
occurrences in the PGP network divided by the total number 
of subgraphs of size 4 in it. For dA'-randomizations, the val- 
ues are averaged, for each d, over several realizations of the 
disT-randomized network. In the case of O-ftT-randomization, 
the last two motifs did not occur in any randomized sample 
of the network. The bottom plot shows the Z-scores for the 
six subgraphs in the four dA'-randomization null models. The 
Z-score [1] of a subgraph is a measure of its statistical signif- 
icance in a real network, compared to a randomization null 
model. Specifically, the Z-score Z is the difference between 
the number A'' of the occurrences of a subgraph in the real 
network and the average number iV of its occurrences in the 
corresponding randomized networks, divided by the standard 
deviation a of its occurrences in the randomized networks, 
Z = \N- N\/a. 



the network structure at the local, medium, and global 
scales. 

The simplest microscopic, local-structure statistics, 
which are not fixed by the dif-distributions with d ^ 3, 
are the frequencies of motifs of size 4 without degree in- 
formation. We compute these frequencies in the real net- 
work and its diiT-randomizations, and show the results in 
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FIG. 3: Mesoscopic scale: community structure. We 

compute communities in the PGP network using the Extremal 
Optimization algorithm [20] . We then sort the found commu- 
nities in the order of decreasing size. The size of a community 
is the number of nodes in it. The rank of a community is its 
position number in the size-ordered list. We then show the 
community size distribution by plotting the community sizes 
vs. their ranks. 



Fig. 2. We find that the (relative) statistical significance 
of the motifs strongly depends on d. More importantly, 
no motif is statistically significant for d = 3. 

At the mesoscopic scale, we consider the community 
structure of the PGP network. A community is a sub- 
graph with many internal connections, and a relatively 
small number of connections external to the subgraph. 
Fig. 3 shows that the community structure is indeed a 
"mesoscopic" metric because the community sizes range 
from a few nodes to thousands of nodes for largest com- 
munities. Fig. 3 shows that the community size distribu- 
tions in the PGP network and its 3A'-randomization are 
very similar. 

At the macroscopic scale, we consider the two most 
popular and important statistics that depend on a net- 
work's global structure: the node betweenness central- 
ity and the distribution of lengths of shortest paths in a 
network. Fig. 4 once again shows that 3K is sufficient 
to capture even global graph properties; the considered 
metrics are approximately the same for the PGP network 
and its 3-ftr-randomization. 

We call a given real network dK -random if all its 
metrics, at all scales from local to global, are approxi- 
mately the same as the corresponding metrics in its dK- 
randomizations. We see in Section IV that in agreement 
with the results of Vazquez et al. [12], almost all net- 
works that we collected data for are 3ii'-random at most 
(some networks are 2K- or even IX-random). That is, 
surprisingly, the global structure of these networks is cap- 
tured entirely by the distribution of node triples and their 



It is an open question why many different real net- 
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FIG. 4: Macroscopic scale: the distance and between- 
ness distributions. The top plot shows the metrics re- 
lated to the hop length of shortest paths, or distances, be- 
tween nodes in the PGP network and its d/i'-randomizations. 
These metrics are the average and maximum distance between 
nodes, the latter called the network diameter, and the stan- 
dard deviation of the distance distribution. The bottom plot 
shows the average betweenness and the standard deviation of 
the betweenness distribution of nodes in the PGP network 
and its dJf-randomizations. The betweenness of a node is a 
measure of its communication centrality [21] . It is equal to the 
number of shortest paths passing through the node, divided 
by the total number of shortest paths between the same source 
and destination, summed over all source-destinations pairs. 
In both plots the values for dif-randomizations are averaged, 
for each d, over several realizations of the diC-randomized 
network. 



distribution captures by definition. 

Whatever the actual explanation, our results have di- 
verse implications. First, our dif-randomization basis 
makes it clear that there is no preferred null model for 
network randomization. To tell how statistically impor- 
tant a given motif is, it is necessary to compare its fre- 
quency in the real network with the same frequency in a 
network randomization, a null model. But one can dK- 
randomize any network for any d. Therefore choosing 
any specific value of d, or more generally, any specific 
null model to compute motif significance requires some 
non-trivial justification. 

The second implication concerns the difference be- 
tween motifs and dK -series. This difference is small but 
crucial. Motifs are subgraphs whose nodes can have any 
degree in the original network, while dK-series preserves 
the information about these degrees. This difference is 
crucial because a motif-based series cannot be inclusive. 
Node degrees are necessary to make the series inclusive 
and thus systematic, see Section V. 

Our finding that many networks are S-fC-random can 
assist our understanding of how functions of an evolv- 
ing network shape its structure. Indeed, one can po- 
tentially simplify such explanations to how the observed 
S-ftT-distribution has emerged in the network. As soon 
as one explains the emergence of the S-fsT-distribution, all 
other network structural properties follow. 

Finally, our work very practically impacts the design 
of network topology models and generators. For simu- 
lation experiments, hypothesis testing, etc., network re- 
searches in many sciences, including biology [9, 24-26] 
and computer science [27-30] , must model real networks 
in laboratory settings, and generate random graphs that 
reproduce important properties of the real network. Our 
results show that it is sufficient to generate S-ftT-random 
graphs for such purposes. But even if these graphs do not 
capture some important property not previously consid- 
ered, the dK-series will remain applicable given its con- 
vergence property and a sufficient increase in d. 

We conclude this introduction with a reference to [19] 
for a detailed discussion of various graph generation tech- 
niques based on dK-series and extensions to generate ran- 
dom graphs with rich semantic, structural, or functional 
annotations of nodes and links. 



works are S-ft'-random. A trivial answer would be that 
d = 3 is just "constraining enough." There may only be 
a few possible rewirings preserving the 3-Rr-distribution. 
But why exactly is d = 3 sufficient for real networks? 
There are many classes of synthetic graphs, such as lat- 
ices, for which no d substantially smaller than the graph 
size is "constraining enough." Perhaps the answer can be 
obtained by studying the hidden metric spaces underly- 
ing real networks [22]. The distances in such spaces ab- 
stract intrinsic similarities between nodes. If these spaces 
are metric — and there is empirical evidence that they are 
indeed such [23] — then the triangle inequality naturally 
yields and explains network clustering, which the 3K- 



II. THE dif-SERIES ILLUSTRATED 

In Fig. 5(a) we illustrate dK-series for a graph of size 
4. The 4-ftr-distribution is the graph itself. The 3K- 
distribution consists of its three subgraphs of size 3: one 
triangle connecting nodes of degrees 2, 2, and 3, and two 
wedges connecting nodes of degrees 2, 3, and 1. The 
2-ftr-distribution is the joint degree distribution in the 
graph. It specifies the number of links (subgraphs of size 
2) connecting nodes of different degrees: one link con- 
nects nodes of degrees 2 and 2, two links connect nodes 
of degrees 2 and 3, and one link connects nodes of degree 
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FIG. 5: The dK-series illustrated: a) dTf-distributions for a graph of size 4; b) convergence and inclusiveness of dK-series. 



3 and 1. The lii'-distribution is the degree distribution in 
the graph. It Hsts the number of nodes (subgraphs of size 
1) of different degree: one node of degree 1, two nodes of 
degree 2, and one node of degree 3. The OX-distribution 
is just the average degree in the graph, which is 2. 

Fig. 5(b) illustrates the inclusiveness and convergence 
of di^-series by showing the hierarchy of d/T-graphs, 
which are graphs that have the same diiT-distribution as 
some graph G of size n. The black circles schematically 
shows the sets of dK-gia,phs. 

The set of OiiT-graphs is largest: the number of dif- 
ferent graphs that have the same average degree as G 
is enormous. These graphs may have a structure drasti- 
cally different from G"s. The set of li^-graphs is a subset 
of Oif-graphs, because each graph with the same degree 
distribution as in G has also the same average degree as 
G, but not vice versa. As a consequence, typical ("max- 
imally random" ) lif-graphs tend to be more similar to 
G than OiiT-graphs. The set of 2A'-graphs is a subset of 
lif-graphs, also containing G. 

As d increases, the circles become smaller because the 
number of different dif-graphs decreases. Since all the 
dX-graph sets contain G, the circles "zoom-in" on it, 
and while their number decreases, dX-graphs become in- 



creasingly more similar to G. In the d = n limit, the set 
of riK-gr aphs consists of only one element, G itself. 



III. THE REAL NETWORKS CONSIDERED 

We collected data for a number of real networks. We 
wanted the set of considered networks to be representa- 
tive, in the sense that it should contain networks of differ- 
ent nature, coming from different domains, thus showing 
the universality of our dX-basis. The considered net- 
works include social, biological, transportation, and tech- 
nological networks. Specifically, we report results for: 

• The social web of trust relationships among people. 
The trust relationships are inferred using the data 
from the Pretty Good Privacy (PGP) encryption 
algorithm [15]. We extract the strongly connected 
component from this network. The nodes are peo- 
ple, and there is a link between two people if they 
trust each other. 

• The social network of scientific collaborations 
extracted from the arXiv condensed-matter 
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TABLE I: The considered networks and their abbreviations. 



1. IK: degree distribution 



Network 


Abbreviation 


PGP Web of Trust [15] 


PGP 


Scientific collaboration network [31] 


Collab. 


Protein interaction network [32] 


Protein 


US air transportation network [33] 


Air 


Internet at the level of ASs [34] 


Internet 


PcnviT grid in tlii' westeiii L'S [3.jj 


Power 



database [31]. The nodes are authors, and there is 
a hnk between two authors if they co-authored a 
paper. 

• The biological network of protein interactions in the 
yeast Saccharomyces cerevisiae collected from the 

database of interacting proteins [32] . The nodes arc 
proteins, and there is a link between two proteins 
if they interact. 

• The US air transportation network [33] . The nodes 
arc airports, and there is a link between two air- 
ports if there is a direct flight between them. 

• The topology of the Internet at the level of Au- 
tonomous Systems (ASs) [34]. The nodes are ASs, 
i.e., organizations owing parts of the Internet in- 
frastructure, and there is a link between two ASs if 
they arc physically connected. 

• The electrical power grid in the western US [35]. 
The nodes are generators, transformers, or substa- 
tions, two of wliic'h arc linked if there is a high- 
voltage transmission line between them. 

Table I lists these networks and their abbreviations used 
in the subsequent figures and tables. 



Fig. 6 shows the distributions P{k) of node degrees k: 

N{k) 



m = 



N 



(1) 



where N(k) is the number of nodes of degree k in the 
network, and N is the total number of nodes in it, so 
that P{k) is normalized, J2k ^i^) — ^ (^^ '^ot con- 
sider nodes of degree fc = 0). The lii'-distribution fully 
defines the OiiT-distribution, i.e., the average degree k in 
the network, by 



(2) 



but not vice versa. 

We observe in Fig. 6 that while O-ft'-randomizations are 
off, the IJsT-random graphs reproduce the degree distribu- 
tions in the real networks exactly, which is by dentition: 
the lii'-distribution is the degree distribution, and IK- 
randomization docs not alter it. The dii'-randomizations 
with c? > 1 do not alter the liiT-distribution either, there- 
fore they also match the degree distributions in the real 
networks exactly (not shown). 



2. 2K: average neighbor degree 

Fig. 7 shows the average degree fc„„(A;) of neighbors 
of nodes of degree k. This function is a commonly used 

projection of the joint degree distribution (JDD) P{k, k'), 
i.e., the 2iir-distribution. The JDD is defined as 



P{k,k')=n{k,k')^^''''''^ 



2M 



(3) 



where N(k, k') = N{k' , k) is the number of links between 
nodes of degrees k and k' in the network, M is the total 
number of links in it, and 



IV. TOPOLOGIES OF REAL NETWORKS AND 
THEIR d/sT-RANDOMIZATIONS 

In this section we compare the real networks to their 
rf-ff-randomizations across a number of topological met- 
rics. 



Iiik.k') 



2 if fc = fc', 
1 otherwise. 



(4) 



so that P(fc, fc') is normalized, ^, P(fc, fc') = 1. The 
2iir-distribution fully defines the liiT-distribution by 



(5) 



A. Metrics defined by d/sT-distributions 

We first consider the most basic metrics, which are de- 
fined by the appropriate cJA'-distributions. Therefore it 
is not surprising that dif-random graphs with appropri- 
ate d have the values of these metrics equal exactly to 
those in the real networks. Nevertheless, we report these 
results for consistency and illustration purposes. 



but not vice versa. The average neighbor degree knn{k) 
is a projection of the 2iir-distribution P{k, fc') via 

We observe in Fig. 7 that while O-ft'-randomizations arc 
way off, the lif-randomization are much closer to the 
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FIG. 6: The degree distribution in the real networks and their dTi'-randomizations. 
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FIG. 7: The average neighbour degree in the real networks and their dJs'-randomizations. 
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The degree-dependent clustering in the real networks and their di^-randomizations. 
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real networks, whereas the 2ii'-randomizations have ex- 
actly the same average neighbor degrees as the real net- 
works, which is again by definition: 2X-randomization 
does not change P{k, k'). In the Internet case, even IK- 
randomization does not noticeably affect knn{k)- The 
rf-ft'-randomizations with d > 2 do not alter P(fe, k') and 
consequently knn{k) at all, therefore they reproduce the 
latter exactly as well for all the networks (not shown). 



3. 3K: clustering 

Fig. 8 shows degree-dependent clustering c{k). Clus- 
tering of node i is the number of triangles Aj it forms, or 

equivalently the number of links among its neighbors, di- 
vided by the maximum such number, which is k{k— 1) /2, 
where k is i's degree, deg(z) = k. Averaging over all 
nodes of degree k, the degree-dependent clustering is 

2A(fc) „^„„„ J2 Ai. (7) 

i: dcg(i) — k 



^"(^^= fc(fc-iMfc) '"^^'^^^(^^ 



The degree-dependent clustering is a commonly 
used projection of the S/T-distribution [38]. The 3K- 
distribution is actually two distributions characterizing 
the concentrations of the two non-isomorphic degree- 
labeled subgraphs of size 3, wedges and triangles: 




Let N/\{k' ,k,k") = N/\{k" ,k,k') be the number wedges 
involving nodes of degrees k, k', and k" , where k is the 
central node degree, and let N^ik, k', k") be the number 
of triangles consisting of nodes of degrees k, k', and k", 
where N/x{k, k' , k") is assumed to be symmetric with 
respect to all permutations of its arguments. Then the 
two components of the 3ii'-distribution are 

N;,{k',k,k") 



P^{k',k,k") = /i(fc',fc")- 



2W 



P^{k,k',k") = ^(fc,fc',fc")^^^^^^|^. 



(8) 
(9) 



where T and W are the total numbers of triangles and 
wedges in the network, and 



^{k, k' , k") 



6 iffc = fc' = fc", 

1 if A: ^ fcV k", 

2 otherwise, 



(10) 



so that both P/\{k' ,k,k") and P^{k,k',k") are normal- 
ized, i:kM,k"P^ik',k,k") = Ek,k'.k"PMk,k',k") = 1. 
The 3i^-distribution defines the 2ir-distribution (but not 
vice versa), by 

pik,k') = , , I . yA^-^PA{k,k\k'') 



k + k' 

W 
M 



2 ^ 

k" 



M' 



[p^{k',k,k")+p^{k,k',k")]Y {11) 



The normalization of 2K- and S/i'-distributions implies 
the following identity between the numbers of triangles, 
wedges, edges, nodes, and the second moment of the de- 
gree distribution fc^ = J^k k^P{k): 



^ZT+W + M 



(12) 



The degree-dependent clustering coefficient c(k) is the 
following projection of the 3i^-distribution 



c(fc) 



6^Efc^fe»^A(fc,fc^fc") 

N k{k-l)P{k) ■ 



(13) 



We observe in Fig. 8 that clustering in the real net- 
works and their d/sT-randomizations with c? = 3 is exactly 
the same, which is again by definition. For d < 3, clus- 
tering differs drastically in many cases, except for the 
air transportation network and especially the Internet. 
Therefore we can say that the Internet is very close to 
being lii'-random, i.e., fully defined by its degree dis- 
tribution, as far as the dX-based metrics are concerned. 
Neither 3K-, 2K-, nor even liT-randomization alter its 
dK-hsseA. (projection) metrics noticeably. 



B. Motifs and their Z-scores 

There are six non-isomorphic motifs of size 4, shown 
as the x-axes in Figs. 9,10. For each network and for 
each d = 0, 1, 2, 3, we obtain several di^T-randomized sam- 
ples of the network, and then for each motif we compute 
its distribution (normalized to the total number of sub- 
graphs of size 4) in the real network, and its average 
distribution in the dJC-randomized samples of the net- 
work. The results are in Fig. 9. Fig. 10 reports the 
corresponding Z-scores. In certain cases, often for QK- 
randomizations, some motifs do not occur at all in any 
randomized samples, which explains the absence of some 
bars in the figures. 

The key observation is that when the randomization 
null model is "iK, the distributions of all motifs in the 
randomizations of all the networks except the power grid, 
are close to those in the real networks. The corresponding 
Z-scores are either low or zero. In other words, all motifs 
are statistically non-significant. 



C. Distance and betweenness distributions 

Fig. 1 1 shows the distance distribution in the real net- 
works and in their dX-randomizations. The distance dis- 
tribution is the distribution of hop-lengths of shortest 
paths between nodes in a network. Formally, if N(h) is 
the number of node pairs located at hop distance h from 
each other, then the distance distribution 5{h) is 



5{h) = 



2N{h) 

N{N -ly 



(14) 



10 




FIG. 9: The motif distributions in the real networks and their d-ftT-randomizations. 





where N{N — l)/2 is the total number of nodes pairs in 
the network. 



To provide a clearer view of how close the distance dis- 
tributions in dif-randomizations are to the real networks, 
we show in Fig. 12 some scalar summary statistics of the 
distance distribution as functions of d. These summary 



statistics are the average distance 

/i = ^M(/i), (15) 

h 

and the standard deviation of the distance distribution 
d{h). In addition we show in Fig. 12 the network diame- 
ter, i.e., the maximum hop- wise distance between nodes 
in the network, which is an extremal statistics of the dis- 
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FIG. 12: The average distance, the standard deviation of the distance distribution, and the network diameter as functions of 
d for dTf-randomisations of the real networks. The corresponding values for the real networks are shown by dashed lines. 
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FIG. 13: The average betweenness of nodes of a given degree in the real networks and their dif-randomizations. 






FIG. 14: The average betweenness and the standard deviation of the betweenness distribution as functions of d for dK- 
randomisations of the real networks. The corresponding values for the real networks are shown by dashed lines. 
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TABLE II: The scalar topological metrics of the real networks 
and the minimum value of d such that the network's dK- 
randomizations approximately preserve all the metrics. 



Metrics 


PGP 


CoUab. 


Protein 


Air 


Internet 


Power 


k 


4.6 


6.4 


6.4 


11.9 


6.3 


4.7 


r 


0.238 


0.157 


-0.137 


-0.268 


-0.236 


-0.273 


c 


0.27 


0.65 


0.09 


0.62 


0.46 


0.68 


li 


7.5 


().() 


i.2 


:',.() 


:i.l 


2.0 


b 


6- 10"" 


4- 10"" 


7 ■ 10"" 


4- 10"^ 


2 • 10"" 


2- 10"" 


dK 


ZK 






2K 


IK 


? 



• r is the assortativity coefficient, 



kk' 



{k^){k) - {k 



2\2 



(18) 



which is nothing but the Pearson correlation coef- 
ficient of the 2/r-distribution P{k, k'); 



c is the average clustering 

c=^c(fc)P(fc), 



(19) 



tance distribution. 

Fig. 13 shows (Icgrcc-dependent betweenness centrality 
b{k) in the real networks and their di^-randomizations. 
Betweenness b{i) of node i is a measure of how "impor- 
tant" i is in terms of the number of shortest paths passing 
through it. Formally, if <Tst{i) is the number of shortest 
paths between nodes s ^ i and t ^ i that pass through 
i, and dst is the total number of shortest paths between 
the two nodes s ^t, then betweenness of i is 



m = E 



<ys,t{i) 



(16) 



Averaging over all nodes of degree /c, degree-dependent 
betweenness b{k) is 



h{k) 



E 



: deg(i)=fc 



N{k)' 



(17) 



We also compute the betweenness distribution, and 
show its average and standard deviation in Fig. 14. 

We observer similar trends with respect to both dis- 
tance and betweenness metrics. The power grid cannot 
be approximated even by its 3if-randomization. The In- 
ternet lies at the other extreme: even lA'-randomization 
does not disturb its global metrics too much. The air 
transportation network appears to come next, as its 2K- 
randomizations resemble it closely. But all the networks 
other than the power grid are very similar to their 3K- 
randomizations . 



D. Scalar topological metrics and dK-randomness 
of real networks 



To conclude this section we show in Table II the most 
important scalar topological metrics for the real net- 
works. These metrics are coarse summary statistics of 
the more informative and detailed metrics that we have 
considered in this section. Specifically, these coarse sum- 
maries are: 

• k is the average degree in the network, Eq. (2), 
which is both the O/C-distribution and a summary 
statistics of the lif-distribution in the dK-senes 
terminology; 



which is a coarse summary statistics of the 3K- 
distribution; 

• h is the average distance, Eq. (15), which is unre- 
lated to rfii'-distributions; 



b is the average betweenness, 

6=EKWfc)> 

k 

unrelated to rfX-distributions as well. 



(20) 



In Table II we also show the minimum value of d such the 
(ii^T-randomization null model approximately reproduces 
the real network with respect to all the metrics above. 

The observation that the power grid cannot be ap- 
proximated even by its 3ii'-randomization is instructive. 
It shows that there are networks for which there is no 
suSiciently small d capable of preserving the network 
structure upon diiT-randomizing. In case of the power 
grid, the explanation why this network is not even 3K- 
random may be related to the fact that it is carefully 
designed and fully controlled by human engineers. In- 
formally, we can think of it as rather "non-random," de- 
signed, and thus bearing a number of constraints that 
the diiT-distributions with low d cannot capture. Infor- 
mally, the higher d required to approximately preserve 
the network structure upon dif-randomization, the less 
"random" the network is. The commonly referred ex- 
planation that the power grid is an "outlier" because it 
is spatially embedded, may be less relevant here because 
two other networks that we have considered (the Internet 
and air transportation) are also spatially embedded. 

What is different between the power grid and the other 
considered networks is that the latter are self-evolving. 
They may be engineered to a certain degree, such as the 
Internet, but their global structure and evolution are not 
fully controlled by any single human or organization. In 
the Internet case, for example, the global network topol- 
ogy is a cumulative efi'ect of independent decisions made 
by tens of thousands of separate organizations, roughly 
corresponding to Autonomous Systems, i.e., nodes of the 
Internet graph. 

In that sense, self-evolving complex networks are 
"more random." However, why the level of their "ran- 
domness" is at d ^ 3 remains an open question. 
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TABLE III: d/f -series vs. d-series 



d 


d/sT-statistics 


d-statistics 





k 




1 


N{k) 


N 


2 


N(k,k') 


M 


3 


N^,(k,k',k") 


W 




NA{k,k',k") 


T 



V. MOTIF-BASED SERIES VS. dif-SERIES 

In this section wc compare (i/f-scrics with the series 
based on motifs, and show that the latter cannot form a 
systematic basis for topology analysis. 

The difference between clK-scrics and motif-series, 
which we can call d-series, is that the former is the series 
of distributions of d-sized subgraphs labeled with node 
degrees in a given network, while the c?-series is the dis- 
tributions of such subgraphs in which this degree infor- 
mation is ignored. This difference explains the mnemonic 
names for these two series: 'd' in 'dif ' refers to the sub- 
graph size, while '/C' signifies that they are labeled by 
node degrees — 'K^ is a standard notation for node de- 
grees. 

This difference between the dK-sevies and d-series is 

crucial. The d/C-series are inclusive, in the sense that the 
{d+ l)i4r-distribution contains the full information about 
the dii'-distribution, plus some additional information, 
which is not true for c?-series. 

To see this, let us consider the first few elements of both 
series in Table III. In Section IV A we show explicitly how 
the (d-h l)i^-distributions define the di^-distribution for 
d = 0, 1, 2. The key observation is that the d-series does 
not have this property. The O'th element of d-series is 
undefined. For d = 1 we have the number of subgraphs 
of size 1, which is just N, the number of nodes in the 
network. For d = 2, the corresponding statistics is M, 
the number of links, subgraphs of size 2. Clearly, M and 
N are independent statistics, and the former does not 
define the latter. For d = 3, the statistics are W and T, 
the total number of wedges and triangles, subgraphs of 
size 3, in the network. These do not define the previous 



element M either. Indeed, consider the following two 
networks of size N — the chain and the star: 




There are no triangles in either network, T = 0. In the 

chain network, the number of wedges is W = N — 2, and 
in the star W = {N - 1){N - 2)/2. We see that even 
though W {d = 3) scales completely differently with N 
in the two networks, the number of edges M = N — 1 
(d = 2) is the same. 

In summary, d-series is not inclusive. For each d, 
the corresponding element of the series reflects a dif- 
feren kind of statistical information about the network 
topology, unrelated or only loosely related to the in- 
formation conveyed by the preceding elements. At the 
same time, similar to dJsT-series, the d-series is also con- 
verging since at d = TV it specifies the whole network 
topology. However, this convergence is much slower 
that in the dii'-series case. In the two networks con- 
sidered above, for example, neither W = N — 2, T = 
nor W = {N - 1){N - 2)/2, T = 0, fix the network 
topology as there are many non-isomorphic graphs with 
the same {W,T) counts, whereas the 3i4r-distributions 
iVA(l,2,2) = 2, TV/, (2, 2, 2) = N-A and N^{1, N-1,1) = 
{N —1){N — 2) /2 define the chain and star topologies ex- 
actly. 

The node degrees thus provide necessary information 
about subgraph locations in the original network, which 
improves convergence, and makes the di^-series basis in- 
clusive and systematic. 
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